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Abstract 

Walley's Imprecise Dirichlet Model (IDM) for categorical i.i.d. data ex- 
tends the classical Dirichlet model to a set of priors. It overcomes several fun- 
damental problems which other approaches to uncertainty suffer from. Yet, 
to be useful in practice, one needs efficient ways for computing the impre- 
cise=robust sets or intervals. The main objective of this work is to derive 
exact, conservative, and approximate, robust and credible interval estimates 
under the IDM for a large class of statistical estimators, including the entropy 
and mutual information. 
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1 Introduction 



about 
where 



This work derives interval estimates under the Imprecise Dirichlet Model (IDM) 
[Wal96j for a large class of statistical estimators. In the IDM one considers an i.i.d. 
process with unknown chance^ 7Tj for outcome i G {l,...,d}. The prior uncertainty 
7r= (7ri,...,7T(i) is modeled by a set of Dirichlet priori {p(tt) ^Y\.i K f i ~ 1 '■ * e A}, 
A := {t : ti > Vi, = 1}> an d s is a hyper-parameter, typically chosen 
between 1 and 2. Sets of probability distributions are often called Imprecise proba- 
bilities, hence the name IDM for this model. We avoid the term imprecise and use 
robust instead, or capitalize Imprecise. The IDM overcomes several fundamental 
problems which other approaches to uncertainty suffer from |Wal96j . For instance, 
the IDM satisfies the representation invariance principle and the symmetry princi- 
ple, which are mutually exclusive in a pure Bayesian treatment with proper prior 
[Wal96|l The counts n; for i form a minimal sufficient statistic of the data of size 
n = J2i n i- Statistical estimators F(n) usually also depend on the chosen prior: so a 
set of priors leads to a set of estimators {F t (n) : t£ A}. For instance, the expected 
chances E t [^i] = n ;^' =:Ui(t) lead to a robust interval estimate [^,^rf] 3 E t [Ki 



Robust intervals for the variance Var t [7Tj] [Wal96] and for the mean and variance 



of linear-combinations Yl^i^i have also been derived |Ber01j . Bayesian estimators 
(like expectations) depend on t and n only through u (and n+s which we suppress), 
i.e. F t (n)=F(u). The main objective of this work is to derive approximate, conser- 
vative, and exact intervals [min tg Ai 71 (it),max tg A-^ 1 ('")] for general F(u), and for the 
expected (also called predictive) entropy and the expected mutual information in 
particular. These results are key building blocks for applying the IDM. Walley sug- 
gests, for instance, to use mm t P t [J-'>c] >a for inference problems and vam t E t [J 7 ] > c 
for decision problems [Wal96j, where T is some function of it. One application is 
the inference of robust tree-dependency structures |Zaf01 , ZH05J, in which edges are 
partially ordered based on Imprecise mutual information. 

Section [2] gives a brief introduction to the IDM and describes our problem setup. 
In Section [3] we derive exact robust intervals for concave functions F, such as the 
entropy. Section H] derives approximate robust intervals for arbitrary F. In Section [5] 
we show how bounds of elementary functions can be used to get bounds for composite 
function, especially for sums and products of functions. The results are used in 



1 Also called objective or aleatory probabilities. 

2 We denote vectors by x := (xi,...,Xd) for x E {n,t,u,n,...}, and i ranges from 1 to d unless 
otherwise stated. See also Appendix iBl 

3 Also called second order or subjective or belief ot epistemic probabilities. 

4 Strictly speaking, A should be the open simplex [Wal96j . since p(tt) is improper for t on 
the boundary of A. For simplicity we assume that, if necessary, considered functions of t can 
and are continuously extended to the boundary of A, so that, for instance, minima and maxima 
exist. All considerations can straightforwardly, but cumbersomely, be rewritten in terms of an 
open simplex. Note that open/closed A result in open/closed robust intervals, the difference being 
numerically/practically irrelevant . 

5 But see Hut07 j for a proper Bayesian reconciliation of these principles. 
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Section [6] for deriving robust intervals for the mutual information. The issue of how 
to set up IDM models on product spaces is discussed in Section[71 Section [8] addresses 
the problem of how to combine Bayesian credible intervals with the robust intervals 
of the IDM. Conclusions are given in Section [9j Appendix lAl lists properties of the 
ip function, which occurs in the expressions for the expected entropy and mutual 
information. Appendix [B] contains a table of used notation. 

2 The Imprecise Dirichlet Model 

This section provides a brief introduction to the IDM, introduces notation, and de- 
scribes our generic problem setup of finding upper and lower statistical estimators. 
We first introduce the multinomial process and the Bayesian treatment with Dirich- 
let priors, and then the IDM extension to sets of such priors. See [Wal96j for a more 
thorough account and motivation. 

Random i.i.d. processes. We consider discrete random variables iE{l,...,d} and 
an i.i.d. random process with outcome iE{l,...,d} having probability 7Tj. The chances 
7r form a probability distribution, i.e. 7r G A := {x G M d : Xj > Vz, x + = 1}, where 
we have used the abbreviation x—(xi,...,x c i) and x+'-=J2f =1 Xi- The likelihood of a 
specific (ordered) data set D = (i 1 ,...,i n ) with rii observations i and total sample size 
n = n + = 'Y^ i ni is p{D\iz) = Y\ i ^ i ■ The chances 7Tj are usually unknown and have to 
be estimated from the sample frequencies n^. The maximum likelihood (frequency) 
estimate — for 7Tj is one possible point estimate. 

The Bayesian approach. A (precise) Bayesian models the initial uncertainty in 
7r by a (second order) prior "belief" distribution p{~k) with domain 7r G A. The 

-I— r n' — 1 

Dirichlet priors p(7r) oc Yli n i ' 5 where ri! i comprises prior information, represent 
a large class of priors. The n- may be interpreted as (possibly fractional) virtual 
number of "observations". High prior belief in i can be modeled by large n[. It is 
convenient to write n'^s-ti with s:=n' + , hence t G A. Having no initial bias one 
should choose a prior in which all tj are equal, i.e. ti = \ Examples for s are 
for Haldane's prior [Hal48j, 1 for Perks' prior |Per47j . | for Jeffreys' prior |Jef46] . 
and d for Bayes-Laplace's uniform prior [GCSR95J. From the prior and the data 
likelihood one can determine the posterior p(7z\D)=p(7r\n) cxY[ i ^ l+sti ~ 1 ■ 

The posterior p(7r | D) summarizes all statistical information available in the data. 
In general, the posterior is a very complex object, so we are interested in summaries 
of this plethora of information. A possible summary is the expected value or mean 
Et[^i] = n ' r ^s i which is often used for estimating 7Tj. The accuracy may be obtained 
from the covariance of 7r. 

Usually one is not only interested in an estimation of the whole vector 7r, but 
also in an estimation of scalar functions T : A — > M of 7r, such as the entropy 
H(tt) = — ^7rjlog7ri, where log denotes the natural logarithm. Since T is itself a 
random variable we could determine the posterior distribution p(jF |n) = J A 5(J r (7r) — 
J r o)p(ir\n)dTT of J 7 , where JF G M and 5() is the Dirac delta distribution. This 
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may further be summarized by the posterior mean EtlJ 7 ] = f A J 7 (Tr)p(ir\n)dTr and 
possibly the posterior variance Var t [jF]. A simple but crude approximation for the 
mean can be obtained by exchanging E with JF (exact only for linear functions): 
E t [J I '('7T)]^J-'(Et[7z]). The approximation error is typically of the order -. 

The Imprecise Dirichlet Model. There are several problems with this approach. 
First, the uniform choice U = 4 depends on how events are grouped into d classes, 
which could be ambiguous. Secondly, it assumes exact prior knowledge of p(tt). 
The solution to the second problem is to model our ignorance by considering sets 
of priors p(7r), often called Imprecise probabilities. The specific Imprecise Dirichlet 
Model (IDM) [Wal96] considers the set of alltGA, i.e. {p(7r|n) :t G A} which solves 
also the first problem. Walley suggests to fix the hyperparameter s somewhere in 
the interval [1,2]. A set of priors results in a set of posteriors, set of expected values, 
etc. For real- valued quantities like the expected entropy E t [H] the sets are typically 
intervals, which we call robust intervals 

E t \T\ G [mmE t [F] , max£ t [^]]. 

Problem setup and notation. Consider any statistical estimator F. F is a func- 
tion of the data D and the hyperparameters t. We define the general correspondence 

u'-' = 1 — — , where •" can be various superscripts or be empty. (1) 

n + s 

F can, hence, be rewritten as a function of u and D. Since we regard D as fixed, we 
suppress this dependence and simply write F = F(u). This is further motivated by 
the fact that all Bayesian estimators of functions T of 7r only depend on u and the 
sample size n+s. It is easy to see that this holds for the mean, i.e. E t [J-']=F(u ; n+s), 
and similarly for the variance and all higher (central) moments. Most of this work is 
applicable to generic F, whatever it's origin - as an expectation of JF or otherwise. 
The main focus of this work is to derive exact and approximate expressions for upper 
and lower F values 

F:=maxF(u) and F :=mmF(u), F:=\F,F]. 

t G A ^ u G A', where A' := {u : Ui>^ ^> u + = !}• We define u F as the u G A' 
which maximizes F, i.e. F = F(u F ), and similarly t F through relation ([T|). If the 
maximum of F is assumed in a corner of A' we denote the index of the corner by 
i F , i.e. tf = <5jjF, where Sij is Kronecker's delta function, and similarly for u—, t—, i—. 

3 Exact Robust Intervals for Concave Estimators 

In this section we derive exact expressions for F_ if F : A — > IR is of the form 

d 

F(u) = fi u i) and concave / : [0, 1] -> M. (2) 

i=l 
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The expected entropy is such an example (discussed later). Convex / are treated 
similarly (or simply take — /). 

The nature of the solution. The approach to a solution of this problem is 
motivated as follows: Due to symmetry and concavity of F, the global maximum is 
attained at the center Ui = ^ of the probability simplex A if we allow mgA, i.e. the 
more uniform u is, the larger F(u). The nearer u is to a vertex of A, i.e. the more 
unbalanced u is, the smaller is F(u). But the constraints U>0 restrict u to the 
smaller simplex 

11; 



A' = {u : m > u®Vi, u + = 1} with //, 



o ._ 



n + s ' 



which prevents setting uf — \ an d Uf—5a. Nevertheless, the basic idea of choosing 
u as uniform / as unbalanced as possible still works, as we will see. 

Greedy F(u) minimization. Consider the following procedure for obtaining u— . 
We start with t = (outside the usual domain A of F, which can be extended to 
[0,1] d via (J2])) and then gradually increase t in an axis-parallel way until t + = 1. 
With axis-parallel we mean that only one component of t is increased, which one 
possibly changes during the process. The total zigzag curve from t start = to t end 
has length t e ™ d = 1. Since all possible curves have the same (Manhattan) length 
1, F{u end ) is minimized for the curve which has (on average) smallest F-gradient 
along its path. A greedy strategy is to follow the direction i of currently smallest 
F-gradient |^ = /'(«*) Since /' is monotone decreasing (/"<0), |^ is smallest 
for largest U{. At t start = 0, Ui = is largest for i = i mm := argmaXjWj. Once we 
start in direction i mm y increases even further whereas all other Ui (i ^ £ mm ) 

remain constant. So the moving direction is never changed and finally we reach a 
local minimum at t e i nd = 5 iim in. Below we show that this is a global minimum, i.e. 

tj- = 5jjF with i— := argmaxnj. (3) 

i 

Greedy F(u) maximization. Similarly we maximize F(u). Now we increase t in 
direction i = i\ of maximal which is the direction of smallest itj. Again, (only) 
Ui 1 increases, but possibly reaches a value where it is no longer the smallest one. 
We stop if it becomes equal to the second smallest u iy say i = i 2 - We now have to 
increase and Ui 2 with same speed (or in an e-zigzag fashion) until they become 
equal to Ui 3 , etc. or until u + = 1 = t + is reached. Assume the process stops with 
direction i m and minimal u being u, i.e. finally Ui k = u for k <m and ti k = for 
k>m. From the^constraint l = u+ = J2 k < A+E fc > A = m ^+J2k>m^r s we obtain 
u = ^[l-T,k> m ^ = { s + T, k < m n ik}/[ m ( n + s)}- One can show that u as a function 
of m has one global minimum (no local ones) and that the final m is the one which 
minimizes u, i.e. 

u = mm — = — r — , where Tij, < rii 2 < ... < rii , u { = niax{Mj,u}. (4) 

mG{i...d} m(n + s) ' 
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If there is a unique minimal with gap > s to the 2nd smallest n« 2 (which is quite 
likely for not too small n and small s like 1 or 2), then m — 1 and the maximum is 
attained at a corner of A (A'). 

Theorem 1 (Exact extrema for concave functions on simplices) Assume 
F:A'^M is a concave function of the form F(u) = Y^ d i= if{ui). Then F attains the 
global maximum F at u F defined in ^ and the global minimum F_ at u— defined 
in (TJ). 

Proof. What remains to be shown is that the solutions obtained in the last 
paragraphs by greedy minimization/maximization of F(u) are actually global min- 
ima/maxima. For this assume that t is a local minimum of F(u). Let j r^argmaxjiij 
(ties broken arbitrarily). Assume that there is a kj^j with non-zero t&. Define il as 
t\ = ti for all i=£j,k, and t^ = tj+e, t' k — t}-—£, for some 0<e<tk- From Uk<Uj and 
the concavity of / we geto 

F{u')-F{u) = Lf(«;)+/(«i)]-[/(«i) +/(«*)] 

= [f( Uj +ae) - f(uj)] - [f(u k ) - f(u k -ae)] < 0, 

where o".— ^^. This contradicts the minimality assumption of t. Hence, U = for 
all % except one (namely j, where it must be 1). (LocaLj minima are attained in the 
vertices of A. Obviously the global minimum is for t— = b ii F_ with i— := argmaxjrij. 
This solution coincides with the greedy solution. Note that the global minimum 
may not be unique, but since we are only interested in the value of F(u—) and not 
its argument this degeneracy is of no further significance. 

Similarly for the maximum, assume that t is a (local) maximum of F(u). Let 
j:=argminjMj (ties broken arbitrarily). Assume that there is a k^j with non-zero tk 
and Uk>Uj. Define t' as above with < e < minjtfc ,tfe— i,}. Concavity of / implies 

F(u') - F(u) = [f( Uj +ae) - f( Uj )) - [f{u k ) - f(u k -ae)) > 0, 

which contradicts the maximality assumption of t. Hence £j = if Ui is not minimal 
(u). The previous paragraph constructed the unique solution u F satisfying this 
condition. Since this is the only local maximum it must be the unique global 
maximum (contrast this to the minimum case). ■ 



Theorem 2 (Exact extrema of expected entropy) Let 7Y(7r) = — ^(K^og'Ki be 
the entropy of n and the uncertainty of n be modeled by the Imprecise Dirichlet 
Model. The expected entropy H(u)\ = E t \hC\ for given hyperparameter t and sample 
n is given by 

n+s 

H{u) = ^ h(uj) with h{u) =u-[ip(n+s+l)-il)((n + s)u + l)}=u^2k- 1 (5) 

i k=(n+s)u+l 

6 Slope ^ ("+ £ )~-f (") i s a decreasing function in u for any e>0, since / is concave. 
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where ip{x) = d\ogT(x)/dx is the logarithmic derivative of the Gamma function and 
the last expression is valid for integral s and (n + s)u. The lower H_ and upper H 
expected entropies are assumed at u— and u H given in (T3|) and ( with F replaced 
by H , see also ^j)). 

A derivation of the exact expression (jSJ) for the expected entropy can be found 
in [WW95| IHutOlj . The only thing to be shown is that h is concave. This may be 
done by exploiting special properties of the digamma function ip (see [A3 74, Chp.6]). 
There are fast implementations of ip and its derivatives and exact expressions for 
integer and half-integer arguments (see Appendix |X] for details). 

Example 3 (Exact robust expected entropy) To see how the derived formulas 
can be used, let us compute the upper and lower expected entropy for for 

d = 2, m = 3, n 2 = 6, i.e. n — 9, and s = 1, hence o = i 
The general correspondence ([1]) becomes 

Ul = 2±k u 2 = hence t° = implies u° = (J-j!) . 

Using ni<n 2 , (02) implies 

t K = 2 , «£=(;), hence = (J 3 7 ) . 
From (j3J, using i\ = l and «2 = 2, we get 

™{m^} = h u" = maxKW=Q. 



u = mm 



This shows that the upper bound is assumed in a/the corner t H = ( ). Inserting 
these u into flSJ), we get 

h(-) = 2Z6i h(-) = ^ h(-) = i^Z ul.) = ML 

"■Vio/ 8400' lb \lQ) 6300' lh \lQ) 4200' "Mo/ 3600' 

Putting everything together we get the robust H estimate 

k = mu^,H(u»)] = [h(±)+h(±),h(±)+h(±)\ 

= [^,^ = [0.5639,0.6256] 
The size of this interval is so H — H_ = 0.0616 is of the order of a. {> 

In general, in order to apply Theorem [IJ we need to be able to (a) somehow 
compute F(u), e.g. compute the expectation Et[J-], (b) verify whether F(u) has the 
form J2if(ui), which is often trivial, e.g. if Tiir) = ^2\4 > { 71 )) y an d ( c ) prove concavity 
or convexity of F. In the following sections we derive conservative approximations 
for more general F(u). 
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4 Approximate Robust Intervals 



In this section we derive approximations for F suitable for arbitrary, twice differen- 
tiable functions F(u). The derived approximations for F_ will be robust in the sense 
of covering set F_ (for any n), and the approximations will be "good" if n is not too 
small. We do this by means of a finite Taylor series expansion in a := -^- s and by 
bounding the remainder. 

In the following, we treat o as a (small) expansion parameter. For u,u* £ A' we 
have 

Ui — u* = a-(ti — t*) and \ui — u*\ = cr\ti — t*\ < a with a := (6) 

Hence we may Taylor-expand F(u) around it*, which leads to a Taylor series in a. 
This shows that F is approximately linear in u and hence in t. A linear function on 
a simplex assumes its extreme values at the vertices of the simplex. This has already 
been encountered in Section [31 The consideration above is a simple explanation for 
this fact. This also shows that the robust interval F is of size F — £ = 0(<r)fl Any 
approximation to F_ should hence be at least 0(cr 2 ). The expansion of F to 0(a) is 

F =O(l) Fr=Q(«) 

F{u) = FW+J2ldiF(u)](ui-<), (7) 

i 

where diF(u) is the partial derivative dF(u) / dui of F(u) w.r.t. Ui. For suitable 
u=u(u,u*)^A' this expansion is exact (Fr is the exact remainder). Natural points 
for expansion are t* = \ in the center of A, or possibly also t* = 21 = u*. Here, we 
expand around the improper point t*:—t^ = 0, which is outside(!) A, since this 
makes expressions particularly simpl^j Eq.([S]) is still valid in this case, and Fr is 
exact for some u in 

A' := {u : Ui > m°Vz, u + < 1}, where = — — . 

n + s 

Note that we keep the exact condition it £ A'. F is usually already defined on A' e 
or extends from A' to A' e without effort in a natural way (analytical continuation). 
We introduce the notation 

FQG F < G and F = G + 0{a 2 ), (8) 

stating that G is a "good" upper bound on F. The following bounds hold for 
arbitrary differentiable functions. In order for the bounds to be "good," F has to be 
Lipschitz differentiable in the sense that there exists a constant c such that 

\diF(u)\ < c and \diF(u) - d,F(u')\ < c\u - u'\ 

7 f(n,t,s) = 0(cr k ) 3cVn€W d ,£eA,s>0 : |/(ra,t,s)| <ca k , where o= 
8 The order of accuracy 0(a 2 ) we will encounter is the same for all choices of it*. The concrete 
numerical errors differ of course. The choice t* =0 can lead to 0(d) smaller Fr than the natural 
center point t* = 4, but is more likely a factor 0(1) larger. The exact numerical values depend on 
the structure of F. 
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\/u,u'eA' e and Vie{l,...,d} (9) 

If F depends also on n, e.g. via a or u°, then c shall be independent of them. 

The Lipschitz condition is satisfied, for instance, if the curvature d 2 F is uniformly 
bounded. This is satisfied for the expected entropy H (see ()5])), but violated for the 
approximation E t [H] ~7i(it) if Ui = for some i. 

Theorem 4 (Approximate robust intervals) Assume F: A' e — »iR is a Lipschitz 
differentiable function (TJ|). Let [F,F] be the global [minimum, maximum] of F re- 
stricted to A'. Then 

F(u l ) C F C F + F£ b where F" b := max F" 6 and F" 6 := amax^w)], 
F + F% E £ E F(u 2 ) wnere FL 6 :=minF/£ and F^ := a min [<9;F(w)], 

i -ueAJ, 

F : = F(w ) ; andt\:=ba\ with i 1 :=argmaxji^, andtj\=b~n2 with i 2 : = argmirij F/J^, 
and E defined in (Q|) means < and = +0(a 2 ) ; where a = l — u Q + . 

For conservative estimates, the lower bound on F and the upper bound on F are the 
interesting ones. Together with the "inner" bounds F(u l ) and F(u 2 ), they also yield 
interesting information about the accuracy of the approximations: Fo+Fr — F(u^ 
is an upper bound on the (unknown) approximation error Fo+F R b —F, and similarly 
for F. 

Proof. We start by giving an 0(a 2 ) bound on F R = max u£ /^i F R (u) . We first insert 
with t * = t° = into (|7j) and treat ?2 and t as separate variables: 



,-mf> 



F B (ti,t) = aJ2[^F(u))-t t □ maxL^^F^-tA 

j e ^ i ^ i 

with F^ fe := o- max[9iF(tt)] (10) 
ueA' e 

The first inequality is obvious, the second follows from the convexity of max. From 
assumption (jUJ) we get diF(u) —diF(u') = 0(a) for all 6 A' e , since A^ has 
diameter 0(a). Due to one additional a in (TTOl) the expressions in fTTOT) change 
only by 0(cr 2 ) when introducing or dropping maxj anywhere. This shows that the 
inequalities are tight within 0(a 2 ) and justifies E- We now upper bound F R (u): 

F R = max F R (u) E max max F R (u, t) E max V F" fc • U = max F" b =: F" b (11) 

uSA' teA iieA£, teA ^— — ' i 

i 

A linear function on A is maximized by setting the t{ component with largest coeffi- 
cient to 1. This shows the last equality. The maximization over u in (fTOj) can often 
be performed analytically, leaving an easy 0(d) time task for maximizing over i. 

We have derived an upper bound F R on Fr. Let us define the corner U = 6„i 
of A with i 1 := argmaXjF^. Since F R > F R (u) for all u, F R (u l ) in particular 
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is a lower bound on Fr. A similar line of reasoning as above shows that that 
F R (v}) = F R + 0(a 2 ). Using F+ const. = F + const. we get 0(a 2 ) lower and upper 
bounds on F, i.e. F(u 1 )^F\ZF Q +F R ib . F is bound similarly with all max's replaced 
by min's and inequalities reversed. Together this proves the Theorem HI ■ 

In the following sections we assume the definitions/notation of Theorem H] for F 
and analogous ones for all other occurring estimators (G,H,I ,...). 

5 Error Propagation 

We now show how bounds of elementary functions obtained by Theorem 0] can be 
used to get bounds for more complex composite functions, especially for sums and 
products of functions. The results are used in Section [6] for deriving robust intervals 
for the mutual information for which exact solutions are not known. 

Approximation of F_ (special cases). For the special case F(u) = ^2 i f{ui) we 
have diF{u) — f'(ui). For concave / like in case of the entropy we get particularly 
simple bounds 

FiR = ^max ueA , /'(«,) = af(u°), F% b = amaxftf) = *f(*%?), (12) 

F? R = amin ueA , f( Ui ) = af'(u° + a), F R = ammf'(u° + a) = ^/'(SHsa*), 

where we have used max u6 A;/'(wi) =max M . e i u p i!1 o + j/ , (M i ) — f'( u i)i an d similarly for 
min. Analogous results hold for convex functions. In case the maximum cannot be 
found exactly one is allowed to further increase A' e as long as its diameter remains 
0(a). Often an increase to □' := {u : < Ui < u®-\-a} D A' e D A' makes the problem 
easy. Note that if we were to perform these kind of crude enlargements on m&x u F(u) 
directly we would loose the bounds by 0(a). 

Example 5 (Approximate robust expected entropy) Let us compare the ex- 
act robust estimate of the expected entropy for ni — 3, n,2 = 6, s — 1 (hence n = 9, and 
a= io) com P u ted in Example [3] with this approximation: Using the expressions for 
hi from Appendix [Aj we get 

U'(3_\ _ 13051 _ 12 j MO-} - 91717 - 7 n 2 

,b \W> ~ 2520 2 11 d,UU 'Hi J _ 8400 6 U ' 

where n = 3.1415. From and ffT2|) we get 

H = H(u°) = h(i) + h(±) = 1, Hf = ifc'®, H lb = 
Together with the expressions from Example [3] we get the conservative estimate 

[H + H lb , H + H u R b ] = [0.5564, 0.6404]. 
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Figure 1: [Expected Entropy] The figures display the various (expected) entropy 
estimates for s — 1: The left figure for ni/n—1/3 and n = 1...10. The right figure for 
n = 9 and n 1 /n = 0...0.5. The "intersection" ra 1 = 3 and n 2 = Q is treated analytically 
in Examples [3] and The green (dark gray) area is the exact robust interval 
[ H , H] from Theorem [2j The yellow+green (gray) area is the conservative estimate 
[Ho + H l ft , Ho+Hft 3 ] from Theorem HI The area [H(u 2 ) , H(v})] is not shown, since 
(here) it essentially coincides with H_). Some point estimates H(^), H( n *^ 2 ), and 
Tl(—) are also shown. 

The approximation accuracy 

H + Hf - H = 0.0148 and H - H - H l * = 0.0074 

is consistent with our 0(a 2 ) estimation. If exact expressions are not available we 
can upper bound the widening by 

H + H$ - H(u v ) = 0.0148 and H(u 2 ) -H - H% = 0.0074 

Since generally u 2 = u— and in our example also u 1 = u H , the numbers coincide. 



Example 6 (Entropy: dependency on n) Figure [1] (left) shows how the size 
of the (conservative) robust interval of the expected entropy H varies with the 
sample size n. We considered s = 1 and d = 2 and kept rii/n = 1 / s and n 2 /n = % 
fix (allowing for fractional n). We clearly see that the yellow (light gray) region 
diminishes quickly compared to the green (dark gray) region with increasing n, i.e. 
the approximation accuracy gets better for larger n. Some point estimates H(^), 

•^"( "n+i )' anc ^ ^(n) are a ^ so snown - Figured] (right) shows the intervals for fixed 
n = 9, while varying n\/n = 0...0.5 in\jn = 0.5. ..1 is symmetric). The interval B_ is 
shorter for more uniform u, since H (like Ti) varies more closer to the boundary of 



11 



A. The [H (u 2 ) ,H (u 1 )] region is not shown since it is identical to H_ (also in the left 
graph except for n= 1). For n = 9 and n\/n= 1/3 we recover the results of Examples 



Error propagation. Assume we found bounds for estimators G(u) and H(u) 
and we want now to bound the sum F(u) :=G(u)+H(u). In the direct approach 
F<G + H we may lose 0(a). A simple example is G(u)=ui and H(u) = —Ui for 
which F(u) = 0, hence = F< G+H = u° i +a-u° i =a, i.e. F^G+H. We can exploit 
the techniques of the previous section to obtain 0(a 2 ) approximations. 

F% = amaxdiF(u) C a max + a max diH(u) = + 

ueA' e ueA' e u&A' e 

Theorem 7 (Error propagation: Sum) Let G(u) and H(u) be Lipschitz differ- 
entiable and F(u) = aG(u) + (3H(u), a,(3>0, then F C F + Fg and FZ\F + F%, 
where F = aG +f3H , and F$QaG&+/3H&, and Fg t n a G$ t + PH lb R . 

It is important to notice that F^^-G^+H^ (use previous example), i.e. maXj[G"^+ 
HiR\ !2maXjG^+maXjif^. max i can n °t be pulled in and it is important to propa- 
gate Fix, rather than F R b . 

Every function F with bounded curvature can be written as a sum of a concave 
function G and a convex function H . For convex and concave functions, determining 
bounds is particularly easy, as we have seen. Often F decomposes naturally into 
convex and concave parts as is the case for the mutual information, addressed later. 
Bounds can also be derived for products. 

Theorem 8 (Error propagation: Product) Let G,H : A' e — > [0,oo) be non-nega- 
tive Lipschitz differentiable functions (TjJ) with non-negative derivatives diG,diH>0 
Vi andF(u) = G(u)-H(u), thenFQF +F£ b , where F = G -H , and F$QG&(H + 
Hr) + (Go + G R b )H^, and similarly for F. 

Proof. We have 

F^ = amaxdiF = amaxd^G-H) = a max[(diG)H + G(diH)] C 
a (max diG) (max H) + a (max G) (max diH) C Gf R (H Q + H u R b ) + (G + G R b )H? b 

where all functions depend on u and all max are over «6 A' e . There is one subtlety 
in the last inequality: maxG ^ G C Go + G R b . The reason for the ^ being that the 
maximization is taken over A' e , not over A' as in the definition of G. The correct 
line of reasoning is as follows: 

max G R (u) C max V Gf R ■ t t = max{0, maxG^} = G u R b max G E G + G R 



The first inequality can be proven in the same way as (TTTT) . In the first equality we 
set the ti — 1 with maximal G™ R if it is positive. If all G™ R are negative we set t = 0. 



[3] and [5] (left and right figure). 
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We assumed G>0 and diG>0, which implies Gr>0. So, since Gr>0 anyway, this 
subtlety is ineffective. Similarly for maxi^R. ■ 

It is possible to remove the rather strong non-negativity assumptions. Propaga- 
tion of errors for other combinations like ratios F = G/H may also be obtained. 



6 Robust Intervals for Expected Mutual Informa- 
tion 

We illustrate the application of the previous results on the Mutual Information 
between two random variables iE{l,...,di} and jG {l,...,^}- 

Mutual Information. Consider an i.i.d. random process with outcome (i,j) G 
{l,...,c?i} x {l,...,^} having joint probability 7Ty, where 7r G A := {x G M dlXd2 : Xij > 
OVij, x ++ = l}. An important measure of the stochastic dependence of i and j is the 
mutual information 

X 0) = ^2^2 n ij lo S — = ^KijloglT^ -^7T i+ log7r i+ -^7T +j log7T +j 

= H(ir»)+H(ir +3 )-H(ir %3 ), (13) 

where 7Tj + = Ylj^ij an d 71 '+j = Si 7r v are row an d column marginal chances. Again, 
we assume a Dirichlet prior over 7r v , which leads to a Dirichlet posterior p(iz l3 \n) cx 
rii < 7 7r y ,+S " with t£ A. The expected value of Wij is 

n + s 

The marginals 7r i+ and are also Dirichlet with expectation w i+ and u +3 . The 
expected mutual information I(u) : = E t [I] can, hence, be expressed in terms of the 
expectations of three entropies H(u) : = E t [H] (see (jSJ)) 

I(u) = H(u l+ ) + H(u +] ) - H(u v ) = H row + H col - H joint 
= J2h(u i+ ) + ^h{u +j ) -^2h(uij), 

i j ij 

where here and in the following we index quantities with joint, row, and col to 
denote to which distribution the quantity refers. 

Crude bounds for I(u). Estimates for the robust IDM interval 
[mintgA-^tf^] j maxteA-EtfZ]] can be obtained by [minimizing, maximizing] I(u). A 
crude upper bound can be obtained as 

7 := max /(it) = max[H row + H col - H joint ] < 
teA J 

max H row + max H co i — min Hj oint = H row + H coi — B_j oint , 

13 



where exact solutions to H row , H co i and H_joint are available from Section[3l Similarly 
I > H_ row + H coI — H joint- The problem with these bounds is that, although good in 
some cases, they can become arbitrarily crude. The following 0(a 2 ) bound can be 
derived by exploiting the error sum propagation Theorem [7J 

Theorem 9 (Bound on lower and upper expected Mutual Information) 

The following bounds on the expected mutual information I(u)=E t [I] are valid: 

Hu 1 ) cIc/ + ^ and Iq + IrQIQ I(u 2 ) , where 

h = I(u°) = H 0row + H 0col - H 0joint = J2i h ( u i+) + Ej h ( u °+j) ~ Yuij K u %)i 

I$r - B&ou, + H ! b Rcoi ~ Hl b jRjoint = h'{ul) + h\u%) - h'(u° j+ o-), 

lijR 3 H iRrow + Hj Rcol — H^ R j oint = h (u i+ +a) + h (u + j+o~) — h 

with h defined in (TJ|) ; andt^ = 0, and tjj = with (ij) 1 = argmaxjj/^, and 

t tj = S(ij)(ij) 2 with (u) 2 = ar g min ii^/? ; an d I R =maxijltf R , and I l R =maxijI^ R . 

7 The IDM for Product Spaces 

In the last section we considered the "full" IDM on the product of two random 
variables. The structure of the problem suggests considering a smaller "product" of 
ID Ms as described below, which can lead to better estimates. 

Product spaces Q = Q±x ...xQ m with Qk = {l,-4} occur frequently in practical 
problems, e.g. in the mutual information (m = 2), in robust trees (m = 3), or in 
Bayesian nets in general (m large). Without loss of generality we only discuss the 
m = 2 case in the following. Ignoring the underlying structure in Q, a Dirichlet prior 
in case of unknown chances n tJ and an IDM as used in Section [6] with 

t e A := {t E R dlXd2 = R dl ® JR d2 : tij > Vij, t ++ = 1} (14) 

seems natural. 

On the other hand, if we take into account the structure of Q and go back 
to the original motivation of the IDM, this choice is far less obvious. Recall that 
one of the major motivations of the IDM was its representation invariance in the 
sense that inferences are not affected when grouping or splitting events in Q. For 
unstructured spaces like fl^ this is a reasonable principle. For illustration, let us 
consider objects of various shape and color, i.e. = S7x x £1 2 , ^i = {ball,pen,die,...}, 
^2 — {yellow,red,green,...} in generalization to Walley's bag of marbles example. 
Assume we want to detect a potential dependency between shape and color by 
means of their mutual information /. If we have no prior idea on the possible kind 
of colors, a model which is independent of the choice of ^2 is welcome. Grouping 
red and green, for instance, corresponds to grouping (xn, x^, 2^3, xa,...) to (2^1, 
Xi2 + Xis, Xi4,...) for all shapes i, where x E {n,7r,t,u}. Similarly for the different 
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shapes, for instance we could group all round or all angular objects. The "smallest 
IDM" which respects this invariance is the one which considers all 

t e A := A dl ® A d2 C A. (15) 

The tensor or outer product (g> is defined as {v®w) i j:=v i 'Wj and V®W := {v®w : 
DGl/ji/iGlf}. It is a bilinear (not linear!) mapping. This smaller product IDM 
A is invariant under arbitrary grouping of columns and rows of the chance matrix 
{^ij)i<i<di,i<j<d2- I n contrast to the larger full IDM A it is not invariant under 
arbitrary grouping of matrix cells, but there is anyway little motivation for the 
necessity of such a general invariance. General non-column/row cross groupings 
would destroy the product structure of Q and with that the mere concepts of shape 
and color, and their correlation. For m > 2 as in Bayes-nets cross groupings look 
even less natural. Whether the A or the larger simplex A is the more appropriate 
IDM depends on whether one regards the structure Q\ x Q 2 of Q as a natural prior 
knowledge or as an arbitrary a posteriori choice. The smaller IDM has the potential 
advantage of leading to more precise predictions (smaller robust sets). 

Let us consider an estimator F: A— > M and its restriction : A— > M. Robust 
intervals [F_,F] for A are generally wider than robust intervals [F^jF®] for A. Fortu- 
nately not much. Although A is a lower- dimensional subspace of A, it contains all 
vertices of A. This is possible since A is a nonlinear subspace. The set of "vertices" 
in both cases is {t : tij=Sa 8jj , z GOi, joEO, 2 }. Hence, if the robust interval bound- 
aries F_ are assumed in the vertices of A then the interval for the A IDM model is 
the same (F_ — F^). Since the condition is "approximately" true, the conclusion is 
"approximately" true. More precisely: 

Theorem 10 (IDM bounds for product spaces) The 0(a 2 ) bounds of Theo- 
rem [7] on the robust interval F_ in the full IDM A [TJ\ ), remain valid for F^ in the 
product IDM A / T751) . 



Proof. 

Fiu 1 ) <F^<F<F + F% b = Fiu 1 ) + 0(a 2 ), 

where F® := max tG ^F(it) and u 1 was the "Fr maximizing" vertex as defined in 
Theorem (F^u 1 ) □ F). The first inequality follows from the fact that all A 
vertices also belong to A, i.e. t 1 e A. The second inequality follows from Ac A. The 
remaining (in) equalities follow from Theorem HI This shows that \F — F\ = O(a 2 ), 
hence F + F]^ b is also an 0(cr 2 ) upper bound to F®. This implies that to the 
approximation accuracy we can achieve, the choice between A and A is irrelevant. 
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8 Robust Credible Intervals 



So far we have considered robust intervals of expected values F = E t [J-']. We now 
briefly consider the problem of how to combine Bayesian credible intervals for T 
with robust intervals of the IDM. 

Bayesian credible sets/intervals. For a probability density p:M d -^ [0,1], an 
a-credible region is a measurable set A for which p(A) := Jp(x)lA(x)d d x>a, where 
lU( ;r ) = l if xEA and otherwise, i.e. xEA with probability at least a. For given 
a, there are many choices for A. Often one is interested in "small" sets, where the 
size of A may be measured by its volume Vol (A) := J l.A(x)d d x. Let us define a/the 
smallest a-credible set 

A min := argmin Vol(A) 

A:p(A)>a 

with ties broken arbitrarily. For unimodal p, A mm can be chosen as a connected 
set. For d — 1 this means that A mm = [a,b] with J a p(x)dx = a is a minimal length 
highest density a-credible interval. If, additionally p is symmetric around E[x], then 
A mm = [E[x} — c,E[x}+c] is also symmetric around E[x\. 

Robust credible sets. If we have a set of probability distributions {pt(x), teT}, 
we can choose for each t an a-credible set A t with pt{A t ) >a, a minimal one being 
A™ m := argminA ;pt ( J 4)>aVol(A). A robust a-credible set is a set A which contains x 
with ^-probability at least a for all t. A minimal size robust a-credible set is 

A min := argmin Vol(A). (16) 

A=U t A t :pt(A t )>a 

It is not easy to deal with this expression, since A mm is not a function of {A™ m :tET}, 
and especially does not coincide with Ut^t™" as one m ight expect. 

Robust credible intervals. This can most easily be seen for univariate symmetric 
unimodal distributions, where t is a translation, e.g. pt(x) = Normal (£4 [a;] —t,a— 1) 
with 95% credible intervals A™ in = [t-2,t + 2]. For, e.g. T=[-l,l] we get [j t A™ in = 
[—3,3]. The credible intervals move with t. One can get a smaller union if we take 
the intervals A s t ym ' = [— q,q] symmetric around 0. Since A s t ym is a non-central interval 
w.r.t. pt for t^O, we have Q>2, i.e. A s t ym is larger than A™ m , but one can show that 
the increase of q is smaller than the shift of A™ m by t, hence we save something in 
the union. The optimal choice is neither A sym nor A™ m , but something in-between. 

To illustrate this point numerically consider triangular distributions instead of 
Gaussians: 

p t (x) :=max{0, l-|a;-t|}, t E T := [—7, 7], 7 > 0, 

a* = min{max{a, 0}, 1}— t, 



pt([ a ,b]) = |6*(l-i|6*|) -a*(l-±\a*\)\ with , , 7 , , 

mL ' u ' 2 v 21 Ul b* = min{max{6,0},l}-t. 

One can derive the following expressions for the a-credible intervals, valid for (the 
interesting case of) a> \. 
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A rnin = ^ _ j + v /f3^ ? t+ l_ v ^3^] ; 

-7 — 1 + Vl — « , 7 + 1 — \/l — a] 



rain 



A 1 



1 — a — 7 , 1 — a/1— a — 7 Z 



for 7 2 < |(l-a), 
-7-1 + ^/2(1-0) , 7+I- V2(l-a) for 7 2 >|(l-a)- 



It is easy to see that A min C U^AJ™*™ anc ^ that A mm is a proper subinterval of Ut^™" 
of shorter length for every 7 > and \ < a < 1. 

An interesting open question is under which general conditions we can expect 
A min C [J t A™ n . In any case, [J t A t can be used as a conservative estimate for a 
robust credible set, since Pt(\J t tA t >) >p t (A t ) >a for all t. 

A special (but important) case which falls outside the above framework are one- 
sided credible intervals, where only A t of the form [a, 00) are considered. In this case 
A min = \J t A™ n , i.e. A min =[a min ,oo) with a min = max{a:p t ([a,oo]) >aVt}. 

Approximations. For complex distributions like for the mutual information we have 
to approximate ( JT6l) somehow. We use the following notation for shortest a-credible 
intervals w.r.t. a univariate distribution pt(x): 

x t = [x t ,x t ] = [E t [x] - Ax t , E t [x] + Ax t ] := argmin (b -a), 

~ ~ [a,6]:pt([a,6])>a 

where Ax 4 :=x t — (Ax f := £^t[x] — x t ) is the distance from the right boundary 
x t (left boundary x t ) of the shortest a-credible interval x t to the mean E t [x] of 
distribution p 4 . We can use x= [x,x\ := \J t (conservative, but not shortest) 

robust credible interval, since Pt(x) >pt(x t ) > a for all t. We can upper bound x 
(and similarly lower bound x) by 



x = max(£ ( [x] + Axt) < max £J t [x] + max Axt = E[x]+Ax. (17) 

We have already intensively discussed how to compute upper and lower quantities, 
particularly for the upper mean E[x] for 16 {J 7 ^,!,...}, but the linearization tech- 
nique introduced in Section H] is general enough to deal with all in t different iable 
quantities, including Ax t . For example for Gaussian p t with variances at we have 
Ax t = Ka t with k given by a = erf(/t/v / 2), where erf is the error function (e.g. k = 2 
for a = 95%). We only need to estimate max t a t . 

For non-Gaussian distributions, exact expression for Ax t are often hard or im- 
possible to obtain and to deal with. Non-Gaussian distributions depending on some 
sample size n are usually close to Gaussian for large n due to the central limit theo- 
rem. One may simply use KtTt in place of Ax t also in this case, keeping in mind that 
this could be a non-conservative approximation. More systematically, simple (and 



17 



for large n good) upper bounds on Ax t can often be obtained and should preferably 
be used. 

Further, we have seen that the variation of sample depending different iable func- 
tions (like E t [x] = E t [x\n]) w.r.t. t& A are of order Since in such cases the 
standard deviation o t ~ n -1 / 2 ~ Ax t is itself suppressed, the variation of Ax t with 
t is of order n~ 3 ^ 2 . If we regard this as negligibly small, we may simply fix some 
t*E A: 

max Ax t = K&t* + 0{n~ 3 ^ 2 ). 
t 

Since Ax t is "nearly" constant, this also shows that we lose at most 0(n~ 3 ^ 2 ) pre- 
cision in the bound (TTTT) (equality holds for Ax t independent of t). 

Robust credible intervals for mutual information. Consider the mutual in- 
formation defined in (1131) . The robust credible interval for X can be estimated as 
follows. 

X < 7 + AX < I + lf + AZ = J + ig + Ky/Vai t . [X] + 0{n- 3 / 2 ). 
Expressions for the variance of X have been derived in [HutOlj : 

Var, [X] = £ f l0 S J * L ~) 2 -^-(j2 lo S + °^ 2 )- 

n + s^ J \ u i+ u +J J n + s\^ u l+ u +j J 

Higher order corrections to the variance and higher moments have also been derived, 
but are irrelevant in light of our other approximations. 

9 Conclusions 

This is the first work, providing a systematic approach for deriving closed form ex- 
pressions for interval estimates for the Imprecise Dirichlet Model (IDM). We concen- 
trated on exact and conservative robust interval ([lower, upper]) estimates for concave 
functions F = Y2ifi on simplices, like the entropy. For the conservative estimates we 
used a first-order Taylor series expansion in one over the sample size n and bounded 
the exact remainder, which widened the intervals by 0(n~ 2 ). This construction may 
work for other imprecise models too. Here is a dilemma, of course: For large n the 
approximations are good, whereas for small n the bounds are more interesting, so 
the approximations will be most useful for intermediate n. More precise expressions 
for small n would be highly interesting. We have also indicated how to propagate 
robust estimates from simple functions to composite functions, like the mutual in- 
formation. We argued that a reduced IDM on product spaces, like Bayesian nets, 
is more natural and should be preferred in order to improve predictions. Although 
improvement is formally only 0(n~ 2 ), the difference may be significant in Bayes 
nets or for very small n. Finally, the basics of how to combine robust with credible 
intervals have been laid out. Under certain conditions 0(n~ 3 ^ 2 ) approximations can 
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be derived, but the presented approximations are not conservative. All in all this 
work has shown that the IDM has not only interesting theoretical properties, but 
that explicit (exact /conservative/approximate) expressions for robust (credible) in- 
tervals for various quantities can be derived. The computational complexity of the 
derived bounds on F = ^2ifi is very small, typically one or two evaluations of F or 
related functions, like its derivative. First applications of these (or more precisely, 
very similar) results, especially the mutual information, to robust inference of trees 
look promising [Z H05] . 

Acknowledgements. I want to thank Peter Walley for introducing the IDM to 
me, Marco Zaffalon for encouraging me to investigate this topic, and Jean-Marc 
Bernard for his feedback on earlier drafts of the paper. 



A Properties of the ip Function 

The digamma function ip is defined as the logarithmic derivative of the Gamma 
function. Integral representations for ip and its derivatives are 



,. , d\nT(z) T'(z) 



dz T(z) 



—t —zt 

e e 



1-e 



-t 



dt, ^ k \z) = {-l) k+1 ^——dt 



oo t k e - zt 



The h function (jSJ) and its first derivative are 

h(ui) = (ni + sti)[ijj(n + s + l) - ip(rii+sti+l)]/(n+s), 

h'{ui) = ip(n+s+l) — xjjfoi + sti + l) — (rii+stiji/j'foi+sti+l), 

For integral s and at argument = and w° = we need ift and ip' only at 
integer values for which the following closed representations exist 

U 1 7T 2 n 1 

i=l i=l 

where 7 = 0.5772156... is Euler's constant. Closed expressions for half-integer values 
and fast approximations for arbitrary arguments also exist. The following asymp- 
totic expansion can be used if one is interested in 0((^^) 2 ) approximations only 
(and not rigorous bounds): 

+ =l0g, + l-^ + O(i), 

This shows that h{ui) converges to — udogitj for n— >oo (and Ui — >-const.), i.e. H(u) 
is close to H(u) for large n. See |AS74l Chp.6] for details on the ip function and its 
derivatives. From the above expressions one may show h" <0. 
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B Symbols 



Symbol Explanation 

5ij Kronecker symbol (5^ = 1 for i—j and Sij = for i^j) 

i,i Discrete random variable, index/outcome/observation g{1,...,g?} 

d Dimension of discrete random variable i 

iTi (Objective/aleatory) probability/chance of i 

log natural logarithm to basis e 

Xi,x,x + Vector x = (xi,...,Xd), x + = x± + ...+Xd, xE{n,t,u,ir,...} 

ti,t Initial bias of i, bias vector 



A = {7r : 7Tj>0Vi, ^^1 = 1} = 7r-simplex (-7TGA) 

A( e ) = {t : ti>0 Mi, ^2jti — 1} = (extended) £-simplex (teA( e \) 

A', e \ = {u : Ui>u® Wi, J2i u i = 1} = (extended) ^-simplex (mGA',J 

s Magnitude of imprecision (n' i = sti is virtual observation #) 

Data/sample {i\,...,i n } 

rii,n,n # of outcomes/observations i, # sample vector, total sample size 

£(■) Dirac delta distribution Jf(x)5(x)dx = /(0) 

p(7r\n) (X Yli 7r i +Stl ~ 1 ^ Dirichlet posterior 

(second order/belief/subjective/epistemic probability) 

E t [F\ Expected value of T w.r.t. posterior p(7r|n) 

w.r.t. with respect to 

i.i.d. independent and identically distributed 

ul =-a- 

* n+s 

u*,t* Origin for Taylor expansion 

a = = 1 — vP + = Taylor expansion parameter 

0(a k ) f(n,t,s) = 0(a k ) 3c Vn GlV d , t G A, s >0 : |/(n,t,s)| <ca k 

H(ir) =-J2iKilogKi = entropy of 7T 

H{u) = J2ih( u i) — expected entropy (see Eq.(jSJ)) 

JF(7t) = function of ir &{H,T,...}) 

F(u) = statistic E^F] or general function (Fe{iJ,J,...}) 

FCG :^F<G and F = G+0(a 2 ), i.e. G is "good" upper bound on F 

u F ,t F maximize (and u—,t— minimize) F(u), tGA, weA' 

F =ma.x t( zAF(u) = F(u F ) = upper value of F(u), similarly F_ 

F_ = [F_,F] = robust /Imprecise interval (estimate) of F 

F + F R {u) =F(u) with F = F(u°) and F R {u) = 0{a) 
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[F l R ,F R b ] 12[E_ R ,F R ]3 F R (conservative [lower,upper] bound on F R ) 
F = [F,F] = credible interval (estimate) of F 

Uij,Ui + ,u + j joint, row, column marginal 

=y^^-7Tjilog^ a — = mutual information of 7r 
I(u) = H(u i+ )+H(u +j )-H(u ij )=H row +H col -H joint 

joint,row,col Index for quantities based on joint, row, column marginal distr. 
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